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Abstract 

Relation extraction with accurate precision is still a challenge when 
processing full text databases. We propose an approach based on cooc¬ 
currence analysis in each document for which we used document orga¬ 
nization to improve accuracy of relation extraction. This approach is 
implemented in a R package called x.ent. Another facet of extraction 
relies on use of extracted relation into a querying system for expert 
end-users. Two datasets had been used. One of them gets interest 
from specialists of epidemiology in plant health. For this dataset usage 
is dedicated to plant-disease exploration through agricultural informa¬ 
tion news. An open-data platform exploits exports from x.ent and is 
publicly available. 

Keywords: Data Sciences, Digital Humanities, Perl, R, Unsupervised 
Learning, Document Structure, End-User, Finite State Automata 


1 Introduction 

More than 90% of available information during history have been produced 
only the last 5 years (see |29jh All kind of data is concerned but if we con¬ 
sider data on internet, usage considered by people are mostly associated to 
textual data (i.e sending a tweet or an email) (see [32]), and on internet 3% 
out of 48 billions URLs are indexed by Google; it means 14 billions webpages 
are under textual format (see ESj). Even on a database such as Youtube 
we can find text annotations for videos. Text processing gets more and 
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more interest in database processing to sift interesting pieces of information 
(see m [59]). Natural Language Processing is naturally engaged in analyt¬ 
ics and for lots of purposes, relations makes sense in many documents (see 
m) ; not only bags of words as it has been widely studied and implemented 
for a long time (see m eq ]). 

Words are ambiguous syntactically or semantically, but if we consider such 
information extraction task as relation extraction with named entities it 
could increase accuracy of extraction because named entities are less am¬ 
biguous than noun phrases (see [58]). In lots of specialized domains, users 
play an important role. The domain acts as a constraint to the lexical uni¬ 
verse, and end-users act as a validation process for extracted information. 
That is why involving users (i.e. domain-experts) in specification of use¬ 
fulness about extraction add constraints about extracted information by a 
system. 

In this sense our proposed system takes into account hand-crafted definition 
of external resources and user interface specification by domain experts. 
These resources are of two kinds: the first one is a proto-ontology of domain 
named entities, the second is rules definition with lexical markers in the case 
concepts can not fit to usual case of named entities (such as an opinion, a 
stage ...). Sometimes we can map to a concept both approaches (dictionar¬ 
ies and rules). Once instances of concept are retrieved, we used heuristics 
of document architecture as rules to reinforce an unsupervised approach by 
co-occurrence analysis. Our tool x.ent integrates some graphical facilities 
help a user to explore exported lists of extractions. 

Part 2 (state of the art) presents the context of named entity recognition, 
relation extraction of named entities. Part 3 (methodology) explains our 
approach to extract relations in a document (datasets, heuristics and al¬ 
gorithms). Finally the part 4 (results) shows result assessment and some 
means to explore relations. 

2 State of the art 

The field of ” Information Extraction” has been defined to identify pieces of 
information associated to real-world objects such that they can be classi¬ 
fied by type, for instance organizatiomorg, persomper, localisationdoc, pro- 
teimprot, disease:dis, pricesipri, times:tim... ; and extended recently with 
others popular components as species, phenotypes, medicines. This an ex¬ 
ample : 

”We performed exorne sequencing in a family with <Crohn’s disease:dis>(CD) 
and severe autoimmunity, analysed immune cell phenotype and function in 
affected and non-affected individuals, and performed in silico and in vitro 
analyses of <cytotoxic T lymphocyte-associated protein 4:prot>(CTLA-4) 
structure and function.” 

Entity recognition is the first stage necessary for other stages that could 
association of other numeric or qualitative information about context of 
a named entity, relation extraction between entities, association of group 
of relations occurring in a same time or a same location, organizing a 
scenario of item-sets over time or causality. This is some example of re- 
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lation: involvecLin, located_in, part.of, married.to, sold.to, interact_with, 
causes_damage_to...). From the previous example we can set the relation 
between categories <prot>and <dis>as ”<prot>involved_in<dis>” and the 
identified instances <cytotoxic T lymphocyte-associated protein 4:prot><Crohn’s 
disease:dis>fits well. At an upper level of extraction we should able to 
achieve a network reconstruction or figuring out sets of events. But at this 
stage, mapping a sequence of events that have not been validated experi¬ 
mentally, or by an expertise, should be only putative. 

There are two families of approaches. Handcraft-design rules by expert ap¬ 
proach is considered as more qualitative or symbolic (also called knowledge- 
rich), and the machine learning approach is considered as more numeric 
or quantitative (also called knowledge-poor). But in any case experts are 
involved in the loop to define resources (well defined lexical dictionaries, 
annotated files, detection rules). Some learning techniques can avoid inter¬ 
vention of experts as distant learning and raw cooccurrence analysis, but 
they are not tractable in any case with good accuracy. 

Named entity recognition (NER) is considered as a solved problem. First 
emblematic conferences, on the topic, was held between 1987 and 1997 to 
identify terrorist activities in news (Message understanding Conferences , see 
|27j ). Lasts conferences was about molecular biology information organized 
by the BioCreativ consortium in 2008, and about information from news at 
the computational natural language learning confernce (CONLL) in 2003. 

Look at Section 4.1 to see the definition of parameters (P or precision, R or 
recall, F-score) to assess quality of extraction. The two main approaches to 
recognize named entities are: 


• Pattern-based algorithms where experts define handcrafted rules. 
Rule-based techniques often works with automata theory (at least 
when processing textual data). An automaton is a machine in which 
a set of states Q contains only a limited number of components and 
is called a finite-state machine (FSM). FSMs are a set of abstract ma¬ 
chines consisting in a set of states (set S), a set of input events (ensem¬ 
ble E), a set of output events (set Z) and a transition state function. 
The transition state function takes the current state and a event as 
input and return a new set of events as output and the next state. 
Hence, it can be seen as a function that maps an ordered sequence 
of events as input to a corresponding sequence, or a set of events, as 
output. With such transition function E —> Z, this is the mathemat¬ 
ical model which gives formal definitions, a finite-state machine is a 
quintuple where : 


— E is an alphabet as input (a finite set, non null of symbols). 

— S is a non-null set of states. 

— so is an initial state, an element of S. 

— 5 is a state transition function : 5 : S x E (in the case of a finite 
state automaton non-determinist it should be <5 : S x E , i.e., 6 
should return a set of state). 

— F is a set of final states, a sub-set (maybe null) of S. 
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For determinist and non-determinist FSMs, it is conventional to con¬ 
sider S as a partial function, i.e. S(q,x ) does not have to be defined 
for all combinations of q G S et x E X. If a FSM M is in a state q, the 
next symbol is x and 5(q, x) is not defined, then M can export an error 
(i.e. reject an input). A finite state machine is a limited Turing ma¬ 
chine where the reader header produces operations, and still go ahead, 
from left to right. A FSM can always be represented by a graphical 
model in which states and transitions are displayed. For instance the 
following, linear, expression can recognize some names of universities 

[A-Z] [a-z]+([ ] [A-Z] [a-z]+)+University 
Result could be ’’Paris Est University” or ’’New York City University”. 
Dictionary-based approach like in Lingpipe (see [12]) or LAITOR (Lit¬ 
erature Assistant for Identification of Terms co-Occurrences and Re¬ 
lationships) (see 0) can achieve a better score if entities are already 
stored in external resources. Some systems can used rules with prede¬ 
fined patterns using lexical tokens and grammatical conditions. Basic 
system using rules are based on finite automata and regular languages 
(see m , eg, 03,011). More complex system use frames (see 0)- 

• Statistical Learning algorithms: HMM (see [23] [31] [2] [6j), MAXENT 
(see 0 Hi, m) , CRF (see [22]), regularized averaged perceptron. 
These approaches require annotated files to learn dependencies prob¬ 
abilities for the model. IllinoisNER system achieves 90.6% FI and 
SNER (Stanford NER) 86.86% FI to the CoNLL03 NER shared task 
data. IllinoisNER use a perceptron model and SNER a MaxEnt ap¬ 
proach. HIT LTP NER use also a MaxEnt and obtain 92.25% FI at 
OpenNLP namefinder (see [22]) uses also MAxEnt approach. 

General accuracy evaluation challenges give a F-score >90% for NER 
task when at the time human assessment is ~ 97%. Relation Extraction is 
more difficult (generally F-score <60%). This is the main basic techniques 
for relation extraction subdivided into two main approaches as for entities 
extraction: 

• Exact Dictionary-Based Chunking (see PI) 

• Handcrafted rule-based techniques W) 

And optimization approaches : 

• Inductive-Logic Programming (see [35], mi) 

• Bayesian Network Analysis (see [56] , [52], HU) 

• unsupervised learning as co-occurrence analysis (see mi, mm, m, 

m , usd 

• semi-supervised learning, as distant learning (see m , ma, m , m, 

051, ESI) 
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In the range of tools using predefined schemas, the tool Pharmspresso 
(see [T9]) aims at extracting instances matching with the schemas ’’medicine 
association gene” in full texts documents. Out of 178 genes mentions, 
Pharmspresso finds 78.1% and out of 191 medicine mentions the tool finds 
74.4%. With the previously cited schema the tool finds 50.3% of associ¬ 
ations ; PPInterFinder (see |53]) is a tool implemented to extract causal 
relationships between human proteins in texts relying on 11 schemas. The 
tool achieves a score 66.05% on the AIMED corpus and overcomes most of 
other syterns. Biocreative challenge II (see [38]) was focused on molecule 
name detection and their relationships ; the overview highlights that one 
of the best scores (F-score=45%) is given by a system implementing lexical 
schemas about relationships. OpenDMAP is one of these tools (see 0 )- 
From the side of tools implementing symbolic learning, LIEP system out¬ 
puts 85.2% F-score (recall 81.6%; precision 89.4%) with a training dataset 
about 100 sentences from the Wall Street Journal. Distant learning gives a 
precision about 67% (in a lexical framework) which does not varies with syn¬ 
tactic assumptions (68%) for 1000 relationships instances concerning films, 
geography, localisation, persons. A hybrid approach like Sprout-Dare gives 
83% as precision on medical full texts in German. 

Among statistical learning approaches, the MaxEnt method has been largely 
used to extract relationships of proteins with a 45% F-score. Unsupervised 
methods based on a simple cooccurrence analysis obtain not so much ridicu¬ 
lous scores (but on abstracts). For instance studies about instance of a 
schema such as ” gene-associated to-disease” highlight a relevance of detec¬ 
tion in abstracts of 78.5% (Precision) and 87.1% (Recall) (with gene recog¬ 
nition scores P=89% and R=90.9%, and about diseases recognition scores 
P = 90% and R=96.6%). 

In the era of internet, lots of resources are now available on internet. 
Generic databases for common knowledge collect millions of facts such as 
Freebase (see [26]) where we can find a relation song/singer like Yesterday 
j-^ John Lennon, Paul McCartney. Freebase claims to register 3 billions 
entities with links. Another famous database is Yago (see [31]) storing 10 
millions entities and 120 millions of facts. Others generic database can be 
mentioned and are also parsed by these cited one but individually they 
can contain lists of interesting pieces of information (WordNet, Wikipedia 
(see [28], |33])). Specific databases propose a summary of a domain like 
Gene Ontology (GO) in molecular biology or Agrovoc (created the by Food 
and Agriculture Organization of the United Nations - FAO) in agricultural 
sciences. These databases are not sufficiently exhaustive even if they can 
provide lots of relevant relations if we look at conferences such as BioCreative 
(see [21]). 

3 Methodology 

3.1 Assumptions 

We settle two assumptions and we will assess these assumptions with pre¬ 
cision and recall parameters. Our main assumption points out the issue to 
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detect relations when features do not occur only inside a sentence but be¬ 
tween sentences and also when a document have some topical organization 
with titles, subtitles and paragraphs. 

Assumption 1 Architecture of a document 
Relevant items and their relationships can occur to different architectural 
part of a document 

Our second assumption is that relationship between named entities are 
explicit and able to be catched by shallow parsing of tokens in the text. 

Assumption 2 Shallow parsing 
Named entities can be captured by shallow parsing 

And a last assumption is that a user can assess results : 

Assumption 3 User needs 

Items found in texts can be interpreted by an end-user and fits its usage 
needs 

With these assumptions we try to infer all possible relationships men¬ 
tioned by a document. Hence after document segmentation three steps can 
be interesting to insert into a pipe : 

1. reformatting and segmentation of a document into subparts; 

2. named entity recognition and contextual information extraction. 

3. relation extraction 

4. contextual information assignment (for instance: pest significance, de¬ 
velopment stage, climat, location). 

To describe the problem in a formal way, we introduce the following 
definitions: 

Definition 1 Text Unit and Entity 

A text unit U is a linked list which consists of words W and entities E. An 
entity can be a word or a set of consecutive words. Entities in a text unit 
are labeled as El, E2 ... according to their order, and they take value that 
range over a set of entity types C E . 

Segmentation into text units is not always an easy task because of 
text components occurring everywhere on a page and not linked sequen¬ 
tially with other text units around. Sometimes a conversion of pdf into 
ascii format may lead to a merge a text unit into another one. As ex¬ 
ample with our dataset, on the Figure [TJ we can see that a text unit 
should start from ”Ble” and should end with ”mais ils ne peuvent enrayer 
les fortes infestations”. In this Text unit El=”Ble”, E2=”ble”, E3=”orge 
de printemps”, E4=”pietin echaudage”, E5=”cereale”, E6=”champignon”, 
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E7=”maladie”, E8=”mais”, E9=”ray-grass”, E10=”luzerne”, Ell=”soja”, 
E12=”champignon”, E13=”nraladie”. They are two kinds of named en¬ 
tity Cl = {El, E2, E3, £5, E8, £9, £10} C 2 = (£4, £6, £7, £12, £13}. Cl 
represents the ’’crop” category, C2 represents the ’’disease” category. 

Definition 2 Relation 

A (binary) relation Rij = (£*; Ej) represents the relation between two 
entities where £* is the first argument and Ej is the second. Such relation 
can belong to a relation type over C R . 

As example with our dataset, on the Figure [lj we can see that the relevant 
relations are associated with El=”Ble” with entities about concept C2, some 
R = {£i.4j £i.6; -Ri.7) £i. 12, -Rl.13}- 

Definition 3 Types of relations between entities 

Let denote the predefined set of relations and of entities C R and C E re¬ 
spectively. 

Lots of kinds of entities can found in C E , named entities like : pests or 
crops, but not only, for instance ’’developmental stage about crops”, ’’devel¬ 
opmental stage for pests”, or ’’kinds of damage”. About relations, several 
kinds of relations can be found, for instance 
C R = {damage relation between a crop and a disease, 
damage relation between a crop and a pest ,...}. 

3.2 Datasets 

We call ROMEO the first dataset. It consists of the 5 acts (Hies) of the 
classical piece of theater from W. Shakespeare Romeo and Juliet in English. 
For this dataset the goal of a user needs could be to follow the relationships of 
persons along the scenes. We summarize the type of concepts and relations 
by these ensembles: 

C E = {character} 

C R = {character talking about or with another character} 

We know that H z C E = 22 for ROMEO dataset. Number of words is 
26,551. Number of tokens (unique words) is 5,846. 

We call BSV the second dataset. This dataset is a collection of scanned 
and digital newsletter written in French. The neswletter is published since 
1946 but majority of numbers are shared by the French National Library 
(Bibliotheque Mitterrand, or BNF) only since 1963. It is written in each 
French region weekly to inform about damage on local crops. We work 
with a sample of 2,323 files. But the dataset in construction should contain 
about 60,000 files in the range 1963-2015. Each file contain between 1 and 
10 pages, 3 in average. For this dataset 8 concepts have been identified 
manually with experts for which we can design dictionaries. Other kinds 
information about context for a crop and its relationships should also be of 
interest such as ’’developmental stage of a crop”, ’’number of a newsletter”, 
’’degree of damage”, ’’climate”. We summarize types of concepts and rela¬ 
tions by the following ensembles: 
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C E = {crops, diseases, pests, auxiliaries, region, towns, chemicals, date, 
developmental stage of a pest, developmental stage of a crop, 
number of a newsletter, degree of damage, climate} 

C R = {damage relation between a crop and a disease, 
damage relation between a crop and a pest, 
intensity of damage with relation between a crop and a disease, 
intensity of damage with relation between a crop and a pest} 


v; 
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Figure 1: Manual annotations in a document from BSV dataset (in yellow: 
crops, in green: developmental stages of crops, in brown: diseases, in red: 
location, in blue: pests, in purple: auxiliaries, in dark blue: time). 

An auxiliary is an insect (as a pest) but not agressive to the crop where 
it lives. Sometimes it can help control of pests. On the Figure [I] we can see 
that we can retrieve useful information. Among them relationships crop- 
disease and crop-disease are also of interest. Thesea are relations pointing 
agression. As we see in document these relationships does not use verb of 
other linguistic patterns. 

To enable evaluation computing we made a annotated dataset about 37 
files. To have in mind how much should cost the annotation process, 1000 
documents would require 5 months for one person. And as one expert is not 
sufficient, another control by an extern expert would need more time. On 
the Figure [2] we see at top-left a sample of manual annotation for a file with 
nouns phrases denotating concepts (in blue) and relations between these 
concept (in red). Concepts are cited as a list but a relation occurs only inline. 
At top-right we see an example of annotation used for CONLL conference 
challenges about named entity recognition, at bottom the annotation format 
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of CONLL for relation extraction evaluation. 
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Figure 2: Manual annotations of a document from BSV dataset for eval¬ 
uation (up-left: csv-like format (in red: entities, in blue: relationships), 
up-right: BIO/BILOU format for named entities, down: BIO/BILOU for¬ 
mat for named entity relationships). 


3.3 Dictionary Matching 

From several concepts denotating named entities about location, persons 
or biological entities like characters, crops, diseases or regions, we are able 
to define lexical nouns phrases and a list of entries associated to a set of 
noun sphrases. These nouns phrases can occur anywhere in the dataset.The 
format we have adopted can described a hierarchy of entries and lexical 
variants of each entry. The computing format is a csv-like format. Each 
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line describe an unique entry, followed by an label N (node)or L (leaf) if the 
entry describe a category or a simple concept. For instance in crops wheat 
is a species and will be defined as a node; but durum wheat, buckwheat 
and soft wheat are defined as leaf because they are varieties and linked to 
wheat. After the category N or L all following nouns phrases are considered 
as equivalent and could be found in a document. Hence the dictionary collect 
all relevant entries of a concept describing hypernym relation and synonym 
relation between lexical phrases. 

For instance, on the Figure [3] we can see a sample of the crop dictionary in 
French (”ble” is wheat, ”ble dur” is buckwheat, ”ble tendre” is soft wheat). 
Figure [4] describe the lexical population for each dictionary about the BSV 
dataset. About ROMEO dataset we can find 22 characters. 


ble:N:ble:BLE:bles:Triticum:ble dur:ble tendre: 

ble dur:L:BLE DUR:T. durum:Triticum durum:bles durs:bles dursible dur: 

ble noir:L:BLE NOIR:f. esculentum:fagopyrum esculentum:sarrasin:bles noirs:bles noirs:ble 

noir:sarrasins: 

ble tendre:L:BLE TENDRE:T. aestivum:Triticum aestivum:ble froment:bles froments:ble 
froments:ble tendre:bles tendres:bles tendres: 


Figure 3: Sample of Crops dictionary. 


Entities Types 



auxiliari 

es crops 

pests 

diseases 

chemicals 

regior 

towns 

gentries 

28 

114 

373 

275 

4968 

26 

33161 

# leafs 

28 

103 

334 

241 

4968 

26 

33161 

#concep 

ts 0 

18 

53 

40 

0 

0 

0 

#lexems 

107 

727 

2673 

1846 

4968 

869 

89603 


Figure 4: Number of entries in dictionaries. 

3.4 Hand-crafted rule 

We used the Unitex tool (see m) to implement hand-crafted rules to detect 
some instance of named entity (date) but also contextual entities (number 
of a newsletter, developmental stage of a crop, intensity of damage on crop). 
Graph edition emphasizes to stack and encapsulate different FSM into a 
more global FSM. One positive point lead to prioritise the longest matching 
sequence to avoid inclusion problems, hence it solves the issue to order rules 
execution. Recognition of a date expression as ”15 janvier 1992” or ”10- 
2012” can be executed with the FSM showed by the following graph on 
Figure [5} We can see that a non-linear formulation enable unification of 
different schemes. 
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Figure 5: Local grammar (FSM) for data extraction. 

More complex and encapsulated graph enable detection of information 
about damage assessment (risk, prevalence or severity). On Figure [b] we 
see a subgraph which can detect sequence having a numerical expression 
about risk. This subgraph can detect an expression such as ’’infestations 
sont limitees a 0,27 larves par pied environ 1 parcelle sur 5 avait atteint 1 
grosse altise en moyenne” (infestations are limited to 0.27 larvae per foot 
about 1 parcel in 5 had reached 1 large flea beetle in average). It is included 
a global graph with 10 subgraphs. 



Figure 6: Local grammar (FSM) and hierarchy of FSMs for damage extrac¬ 
tion. 

3.5 Architecture Document Heuristics 

Organisation of a document (titles, subtitles, references, sections, headers, 
table, pictures, summary, introduction, discussion) can influence the way to 
make extraction. We call this organizatin the architecture of a document. 
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Of course lots of architecture are availiable and the set of heuristics is not 
limited. We propose three heuristics can help us to be more accurate in 
relatioship extraction that we test with BSV and ROMEO datasets. 

Heuristics 1 Main entity 

A target entity occur in a specific title or subtitle (beginning of a paragraph 
or a section). 

On Figure [l] we see that main entity occur in the title of each section. 
It fits with heuristics [lj 

Heuristics 2 Header 

Different entities occurs in the header of the document (first lines). 

On Figure[l]we see that instances of entity types region, issue, date occur 
in the Header. It fits with heuristics [2j 

Heuristics 3 Avoid section 

Some paragraphs begining by a specific title can contain entity but not 
associated to a main entity or contextual information. 

On Figure [7] we see that a section begins by ’’Raisonner la lutte contre” 
(’’Reasoning control against”). If we donot exclude this section of the anal¬ 
ysis we get a relationship instance ” crop/pest” as cereals/wireworms that is 
false. It fits with heuristics [3j 
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Figure 7: Neighbourhood of a section with a main entity and an avoid section 
for the BSV dataset. 

3.6 Unsupervised learning 

We used a classical unsupervised learning approach called cooccurrence anal¬ 
ysis. Three family of cooccurrence can be implemented. 

Definition 4 Entity position 

Let Ei be a target entity. A document is split into a set of textual unit 
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(TU). A TU can be a section, a sentence or a paragraph. Let P^ be the 
position in terms of word, and P^ ir of the header word of Ei in the document. 
We define a window by WL, i.e. the number of words at left from P^, and 
WR the number of words at right from P * . Wr, respectively Wl, can be 
oo if we look the right, resp. left, context till the end, resp. the beginning, 
of the document. 

Type 1 Text Unit Cooccurrence Let Ei be a target entity, and Ej another 
entity. We define the cooccurrence by the following function cooc (Ei, Ej) is 
a binary function such as : 


cooc 


<p p \ _ j 1 if P' w £ Ptui an( f Pw ^ ^tu an( f satifies heuristics [T] pland [3l 
{ U j) ' ’ 1 0 else. 


Type 2 Window Cooccurrence the same as type [l] but now: 

cooc(E h Ej) = 1 if (Pi - W L ) < pi, < (P' w + W R ) 


(1) 


(2) 


Type 3 Constrained Cooccurrence The same as type |T] or type [2] But 
now let be a list of markers nik, at least one marker nik need to be located 
between Ei and Ej so : 

cooc(E u Ej) = 1 if |P; - P*| < |Pi - Pi | (3) 


3.7 Step 1: Named Entity Recognition 

Assumption [2] gives us to understand that named entities are explicitly writ¬ 
ten in the text as tokens. Assumption [3] highlights important way to extract 
only useful named entities for a specific usage. In that sense extraction is 
dictionary-driven for better relevance (see Algorithm [I]). 
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Algorithm 1 named entity extraction algorithm. 

Require: dictionaries and grammars. 

Ensure: data entities. 

read all dictionaries (with nodes and leafs) and grammars 
for all doc in corpus do 

for all die in dictionaries do 
if words match in die then 
push data_entities, words 
end if 
end for 

for all gra in grammars do 

if words match in gra then 
push data_entities, words 

end if 
end for 
end for 

check data entities inclusive other words 
sort_asc data_entities 

for i = 0; i < len(data-entities ) — 1; i + + do 

for j = l en(data.entities) — 1; j > i;j -do 

if data_entities[i] exist in data_entities[j] then 

if position of data_entities[i] in then document is 
the same position of data_entities[j] in the docu¬ 
ment then 

Remove data_entities[j] 
end if 
end if 
end for 

Extract data according the entities 

end for 


There are entities for which we can not build the dictionaries, we pro¬ 
pose the construction of grammar to extract the contents in the corpus, for 
instance pest significance, developmental stage of a crop, climate, location. 
We have integrated the Unitex tool for building grammar. Here in the fol¬ 
lowing example, the uses of grammar is more reasonable, such as location, 
we can use the dictionary to store regions, cities or even towns. However 
there are names of region that combine with words of direction such as 
’’north”, ’’south”, ’’west”, etc. Because of that, the use of grammar will be 
more flexible and increase accuracy. This is some grammatical rules: 
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• <words in the dictionary> 

• <keyword l>...<end of sentence or punctuation> 

• <words in the dictionary>...<keyword 1> 

• <keyword l>...<words in the dictionary> 

For example, to retrieve a developmental stage of crops in Figure [8j 
’’Stades: 90 a 100% de couverture du sol terres colorees”, the grammar will 
be as follow: 


3H3-*— hMHFF 




<TOKEN> 

r~ 


<STA> 


"i>-® 

</STA> 


Figure 8: Local grammar (FSM) for developmental stage. 

This diagram shows that, to begin finding a phrase that has the word 
’’stade” or ’’stades” then two points, there is a loop to go through all the 
words in this sentence, to the meeting point signal ”or semicolon ”. Two 
words <STA> and </STA> mark the result. 

3.8 Step 2: relation extraction 

Relation extraction takes as input item-sets to identify relations the export 
from Algorithm [lj and plays with the three heuristics. Heuristics [l] set that 
some entities are main entities (i.e. a category is a chosen as a target) and 
we seek relations for these entities. Heuristics [2] sets that target entities 
are declared in header sections (titles, subtitles) and heuristics [2] declares 
that some sections are non-relevant, we called them avoid sections and they 
can be specified by a beginning phrase and can end by the end of docu¬ 
ment of another phrase. Algorithm [2] describes how relation extraction is 
implemented, x.ent implements also a class of algorithm to detect relation 
without heuristics in case, a document only consists of paragraphs (i.e a 
tweet, an email or a news). 
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Algorithm 2 relation extraction algorithm. 

Require: data entities of the document. 

Ensure: relations. 

for all line in this document do 

let paragraphs = analyze structure doc {this is a step in 
analyzing document structure or concurrence in definition 

4-} 

end for 

for all para in paragraphs do 

if exists Ei and Ej in this para then 
push this relation 

end if 
end for 


3.9 Step 3: contextual information assignment 

Some pieces of information are considered as category of named entities be¬ 
cause they describe a pattern of reality but they do not denote a specific 
object as : crop damage, developmental stage, climate. Often, these cate¬ 
gories often cannot be designed in a dictionary but with handerafted-rules, 
but the y ar e detected at the same time as others and independently (see 
Section 3.7). 


Nevertheless they can describe more precisely the context of a relation¬ 
ship between two entities. That is why some relationship are not only binary 
but n-ary (for instance crop-disease-damage). Damage here describes a mag¬ 
nitude of the relationship. 


In the Figure [9j we have applied an algorithm to analyze the structure of 
paragraphs containing the entities ’’crop”, ’’disease” and ’’damage” to find 
out the relationship crop-disease-damage. The Algorithm [T] found a crop’s 
value ’’Colza” at the beginning of line, this value will be the one for breaking 
paragraphs at this position to a position of next crop entity or to the end 
of document if it doesn’t exists a value of crop entity at the start of line. 
In these paragraphs, we continue to analyze the structure of paragraphs 
according to disease entity, in this case ’’Charangon de la tige du colza” is 
stated by a first line. Finally, we find all the values of damage entity in these 
segments that contain the values of disease entity. In this example, we will 
find that the relationship is ’’Colza:Charangon de la tige du colza:nuisibilite 
est elevee” but not relation: ”Colza:mouche du choumuisibilite est elevee”. 
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[COLZAP -- 

STADE : reprise a decollement de la tige 


Un decollement de la tige (stade C2) est note dans 5 parcelles du reseau. Sur les autres 
parcelles, la reprise de vegetation (stade Cl - reverdissement au niveau des coeurs - atteint) est 
maintenant effective, avec un nombre de plantes concernees encore limite sur seulement 2 
parcelles. 

Dans certaines parcelles, on trouve des pieds de colzas violaces qui sont mal enracines : il s'agit 
de degats de mouche du chou et/ou de gel mecanique sur de petits colzas (plus marques en sol 
calcaire). On note localement des degats de pigeons . 


_ Desease 

[CHARANCON DE LA TIGE DU COLZA 


ConnaTtre le charan?on de la tige du colza. — Damage 

Sa [nuisibilite est eleveel L'adulte n'est pas directement nuisible; c'est I'introduction des oeufs 
dans la tige qui provoque une reaction et conduit a la deformation, voire a I'eclatement des 
tiges. Les pertes de rendement sont aggravees en cas de stress hydrique ou d'attaques de 
meligethes . Etant donne la nuisibilite potentielle de cet insecte, il est considere que sa seule 
presence sur les parcelles est un risque. Le vol debute des que la temperature de I'air depasse 
9°C mais ne se generalise que si les temperatures sont superieures a 12-13°C. Si les 
temperatures redeviennent defavorables, les charancons retournent s'abriter dans le sol mais 
restent actifs si la temperature est superieure a 6°C. 


Figure 9: An example for analyzing a relationship of three entities ’’crop- 
disease-damage” . 


Document 
-entity1_a 


paragraph 
_entity2_a’ 


--entity2_b’ 

entityl _b 


segment 

segment 


paragraph 

_entity2_a’ 1 

I — -segment 

-entity2_b" 

I___segment 


find all values of entity3 

\ 


relation extraction: entityl_a:entity2_a':list_of_entity3 


Figure 10: Analyse a document to find out a relationship of three entities. 
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Algorithm 3 Transformer data. 

Require: document , entity Jag 
Ensure: paragraphs. 

for all line in document do 
mark-curr 0 
markjprev -f— 1 
phases <— null 

if data of entity_tag is the start of the line or upper case 
first letter in line then 

m.ark-curr markjcurr + 1 
if markjcurr > markjprev then 
push paragraphes, phases 
phases <— null 
end if 

phases 4- phases + line 

end if 

if length(phases ) > 0 then 

push paragraphes, phases {push the final paragraph.} 

end if 
end for 

return paragraphes 
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Algorithm 4 Contextual Information Extraction. 

Require: document , entitie-tags 
Ensure: relations of contextual information 
if length{entity-tags) = 3 then 

{check a relationship of three entities} 

paragraphs 1 = Trans former Data(document, entity -tag s[0]) 

for all paral in paragraphs 1 do 

paragraphs2 = Trans former Data(paral, entity-tag s[l]) 

if length(paragraphs2)^ 0 then 

for all para2 in paragraphs2 do 

if para2 exists values of entity -tags [2] then 
get values of entity-tag s[2] in para2 
get value of entity-tags[ 1] in para2 
get value of entity-tags) 1] in paral 
push relations, values of entities tags 
end if 
end for 
end if 
end for 
end if 

return relations 


4 Results 


4.1 Evaluation about extractions 

We proceed to a double evaluation process: 

Firstly, we compare x.ent export of named entities with those produced by 
well-known approaches : exact dictionary-matching and MaxEnt approach 
with respectively LingPipe tool and SNER tool. Table [TT] show results about 
crop, disease and pest names extraction. Standard measures to assess accu¬ 
racy of a system rely on known pieces of information we aim to extract in a 
test dataset. The three parameters for assessment are the following: f-score 
(Equation [6| , recall (Equation [5]) and precision (Equation 4]) . 
x.ent produce score as good as those revealed by Lingpipe. Lingpipe propose 
also a machine learning approaches based on hidden-markov models but it 
gives less good results. 

Secondly, we compared relation extraction of x.ent and those exported by 
SNER and cooccurrence approach with different window parameters. SNER 
use a parsing tree analysing and French has been considered to process BSV 
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dataset. Table [12] display that x.ent capture more good relation than other 
state of the art approaches with a F-score about 55%, when SNER produce 
38% and cooccurrence window-base approach 42% (see Figure 13 about F- 
score variation according the window size). 


< p < i p #correct-answers 
^produced .answers 


(4) 


0 < R < 1,R = 


^correct -answers 
^possible-correct-answers 


(5) 


0 < FI < l, FI = 


(/3 2 + 1) P R 
(/3 2 .R + P) ' 


( 6 ) 


For Equation [6] usually /3 = 1. 



X.ENT 

SNER 

LINGPIPE 


P 

R 

FI 

P 

R 

FI 

P 

R 

FI 

BIO 

96.46 

95.52 

95.98 

92.66 

71.41 

80.52 

96.45 

95.53 

95.99 

MAL 

96.97 

95.53 

96.24 

95.46 

77.38 

85.38 

96.97 

95.52 

96.24 

PLA 

88.80 

98.67 

93.47 

93.99 

82.68 

87.94 

88.80 

98.67 

93.47 

REG 

100 

100 

100 

93.20 

73.73 

81.92 

100 

100 

100 

TOT 

94.33 

96.67 

95.48 

93.68 

76.85 

84.41 

94.34 

96.65 

95.48 


Figure 11: Evaluation of named entity recognition. 



X.ENT 

COOCCURRENCE 


P 

R 

FI 

P 

R 

FI 

PLA-BIO 

53.4 

75.8 

52.7 

36.4 

50.5 

42.3 

PLA-MAL 

58.1 

69.5 

63.3 

41.3 

38.7 

40.0 

TOT 

55.3 

73.1 

62.9 

38.1 

45.4 

41.4 


Figure 12: Evaluation of entity relationship recognition. 
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Figure 13: FI score in the parameter space (left and right window) about 
cooccurrence window-based approach. 

4.2 Visualisation 

The x.ent tool has been developped with Perl modules concerning the pars¬ 
ing function and but is encapsulated under an R package availaible on the 
R platform (see m- The package offers also R functions to explore results 
of extraction : parallel coordinates, histogram, Venn diagram, stacked bar 
graph and statistical test on pairwise relation. 

On Figure [14] we see an example of parallel coordinate visualization be¬ 
tween two sets of entities (el and e2). el is a target entity with which we 
seek relations. About BSV dataset the target entityt is crop category. In 
the example e2 are a set of entity from different categories (’’rnouche du 
chou” is an instance of pest category and ’’mildiou” is an instance of disease 
category). 

The R code is the following: 

xplot(el=”colza”,e2=c(”mouche du chou’, ’’mildiou”)) 

We can add a constraint about the time : 

xplot(el=”colza”,e2=c(”mouche du chou”, ”mildiou”),t=c(”09.2010”, ”02.2011”)) 
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Figure 14: Parallel coordinate display. 


Figure 15 shows the distribution over time about a specific relation 


”colza:mildiou”, but it could work with any instance of an entity but only 
in the case a date entity is extracted from the dataset. 

The R code is the following: 

xhist(” colza :mildiou”) 


Histogram of bulletin: date 



Date 


Figure 15: Histogram over time of a relation. 


Figure 16 shows and a stacked bar graph representing for a first set of 
entities, in the example the crops: ”ble”, ’’mai’s”, ’’tournesol”, ”colza”, the 
proportion of each instance of entity of the second set, hereafter ’’mouche 
du chou”, ’’puceron”. 

The R code is the following: 

xprop( c(”ble”, ’’mats”, ’’tournesol”, ’’colza”) , c(”mouche du chou”, 
’’puceron”) ) 

If the first set the the whole set of instance of the target entity catagory 
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(hereafter instances of crops) and having at least 2 occurences : 

vl = as.vector(xdata-value(”p”)$value[xdata_value(”p”)$freq >2]) 
xprop(vl,c(”mouche du chou”puceron”)) 



cat 


I mouche du chou 
pucoron 


count 



count 


cat 


I mouche du chou 
puceron 


Figure 16: stacked bar graph. 

Figure [17] shows a Venn diagramme between a set of instance from the 
target category (hereafter ”ble”,”orge de printemps”,”tournesol”) and a set 
of instances from specified categories (hereafter b and m, denotating respec¬ 
tively pest and disease): 

The R code is the following: 

xvenn(v=c(”ble”, "orge de printemps”, ”tournesol”),e=c(”b”, ”m”)) 


orge de printemps 

toumesol 


ble 


Figure 17: Venn diagram. 


Figure 18 shows a comparison between a crop (hereafter ”ble”) and all 
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possible instance of another entity category (hereafter instances of pest cat¬ 
egory). Four tests has been implemented for the function : Kolmogorov, 
Wilcoxon, Student and GrowthCurves. At moment no decision function 
makes interpolation over the tests to decide if yes or no the p-values agree 
for a positive similarity or not. Figure 19 shows an export with all p-values 
saturating at 1. For instance ”ble:limace des jardins” and ”ble:adventice” 
have the same distribution across the BSV dataset. It means that ’’lirnace 
des jardins” occurs at same time that ’’adventice” in ”ble” crop cultures. 
The R code is the following: 

xtest( v ble”, as.vector(xdata^value(”p v ))) 



relation 

KOLMOGOR 

JVWILCOXON 

STUDENT 

GrowthCurves 

700 

ble: mehget he / ble: t hr ips 

1.00 

0.13 

0.13 

0.02 

543 

blexicadelle/ble:pyrale 

1.00 

0.00 

0.00 

0.02 

613 

blexriocere/ble:thrips 

1.00 

0.00 

0.00 

0.02 

689 

ble:meligethe/ble:pucero 
des epis de cereales 

n 0.91 

0.00 

0.00 

0.02 


Figure 18: Pairwise relations comparison. 


ble: adventice/ble :limace des jardins 
ble: adventice/ble :puceron des cereales et du rosier 
blexampagnol des champs/blexorbeau freux 
blexampagnol des champs/ble:pyrale 
blexampagnol des champs/ble:zabre des cereales 
blexecidomyie jaune du ble/blexharangon 
blexecidomyie jaune du ble/blexharangon de la tige 
blexecidomyie jaune du ble/ble:mouche grise des cereales 
blexecidomyie jaune du ble/ble:noctuelle 
blexecidomyie jaune du ble/ble:oscinie de l’avoine 


Figure 19: Global saturation of all tests. 

4.3 Integration 

The x.ent tool has been used to parse BSV dataset so as to export result 
in a csv format and used in a database management system to be queried 
by end-users. In this system relations are pivotal information to offer useful 
piece of information for information retrieval. We can mention four cases of 
usage with main query and refinement: 

• main query is relation crop-disease, refinement damage and region on 
map. 

example: Crop=wheat, Disease=rust, On map : risk assessment, re- 
gion=Burgondy 

• main query is crop, refinement pest (relation crop-pest) and region on 
map. 

example: Crop=rapeseed, On map : Pest=cabbage maggot, region=Centre, 
and document sorting by date 
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• main query is disease, refinement pest (relation crop-disease) and re¬ 
gion on map. 

example: disease=potato late blight, On map : crop=potato, re- 
gion=Bur gundy 

• main query is pest, refinement pest (relation crop-pest) and region on 
map. 

example: pest=fly, On map : crop=wheat, region=Midi-Pyrenees 

At present 4 users specialists about potato and wheat take benefit from 
the database and platform-as-service. More epidemiologists and agronoms, 
from the integrated crop protection network (reseau PIC - protection integree 
des cultures) including 400 subscribers, are potentially interested in using 
this web platform. Risk analysis in a sociological point of view is also pos¬ 
sible. 
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Figure 20: Vespa user interface. 


4.4 Availability 

We developed, improved and applied a relation extraction method we imple¬ 
mented as an R-project package ( x.ent ). The package is available from the 
CRAN R project server (http://cran.r-project.org/ see Software, Packages; 
x.ent version 1.0.6), and downloadable from the R graphical user interface 
(required R libraries : xtable (see jl6]), jsonlite (see [50]), venneuler (see 
m), ggplot2 (see [61]), stringr (see [60]), opencpu (see [49]) and rJava [15] 
)• 

The results of BSV corpus processing has been stored in a relational database 
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with a web-front web access. The temporary website http://vespa.cortext.net 
display the front-end interface in which a user can query the result to retrieve 
relevant documents. 

4.5 Discussion 

Information extraction is not a new field but new opportunity with new us¬ 
ages and new corpora emphasizes this kind of task. If number of document 
in a corpus can not be huge (several ten thousands to several millions), the 
number of possible relations has no limit. Extract good and relevant rela¬ 
tions, store all relations, and query relations in a concrete usage context can 
be challenges. 

Lots of factors can influence relation extraction. We explore the capacity of 
syntactic expression in document to extract relations. Indirect relation are 
also possible, as in genetics when a genecist set that geneA interact geneB 
and geneB interact geneC than geneA can be in interaction with geneC, of 
if geneA interact with geneC in a species, then it is also a putative relation 
in another species. In our BSV dataset we do use any inference protocol. 
We try to take into account all possible signs in a document. Hence our 
approach goes further than a linguistic approach the aim of which is to 
analyze the structure of each sentence. Our point of view is equivalent to 
argumentative analysis when part of speech are linked by sections. In this 
point of view we show that sometimes specific concept of interest can be 
situated in a special location in a document as in a text of theater about 
speaking characters, or in a newsletter with titles. 


5 Conclusion 

We developed, improved and applied a relation extraction method available 
as a R package. The tool has been involved into an information system 
called Vespa Mining with end-user (agronoms and epidemiologists). 
Extraction task involve the user to design a proto-ontology of its domain 
with a set of categories. Each category make sens with instances (string 
sequences) for which small local grammars and flat dictionaries can fit in 
documents. A target category is settle to search other instances of an¬ 
other category as a relationship and contextual information. The tool relies 
on both hypothesis that named entities are extractible and that document 
structure helps extraction. We compare with state of the art tool and we 
show that if x.ent can reach the performance of named entity recognizers, 
assessment about relation extraction give better scores. Exploitation of doc¬ 
ument structure together with unsupervised learning can achieves high score 
of extraction. 

We used two datasets. A literary dataset about a Shakespeare theater piece 
and an agricultural newsletter dataset. The goal about the newsletter was 
to learn relations as crop-disease-damage and crop-pest-damage. Designing 
an evaluation dataset we obtain an F-score ~ 55. 

Our interest was also to help a user to explore the large potential amount 
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of relationships. Two means was implemented in that direction. Firstly in¬ 
formation visualization capacities such as : parallel coordinates, histogram, 
Venn diagram, stacked bar graph and statistical test on pairwise relation. 
Secondly an integration of the tool in a user-friendly platform with Concrete 
real-world information. Here the user can browse the dataset through rela¬ 
tionships and complementary information (locations, damage magnitudes, 
or simple keywords) through geolocatlisation and feedback to original doc¬ 
uments. 
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